KAMD: A Progress Estimator for MapReduce Pipelines

نویسندگان

  • Kristi Morton
  • Abe Friesen
چکیده

Limited user-feedback exists in cluster computing environments such as MapReduce. Accurate, time-oriented progress indicators could provide much utility to users in this domain, where job execution times can have high variance due to the amount of data being processed, the amount of parallelism available, and the types of operators (often user-defined) that perform the processing. This feedback would help users make informed decisions, such as whether a job should be terminated and restarted at a later time when the cluster has more resources available. However, none of the techniques used by existing tools or available in the literature provide a non-trivial progress indicator for queries running in a distributed environment. In this paper, we apply recently developed techniques for estimating the progress of single-site SQL queries to parallel environments. In particular, we target environments where queries consist of MapReduce job pipelines. We also present techniques that improve the accuracy and usefulness of progress estimators operating in this environment. We implemented our estimators in the Pig system and demonstrate its performance on experiments with real data (search logs) and with a real cluster.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallelizing XML Processing Pipelines via MapReduce

We present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the MapReduce framework. Pipelines in our approach consist of sequences of processing steps that consume XML-structured data and produce, often through calls to “black-box” functions, modified (i.e., updated) XML structures. Our main contributions are a set of strategies for...

متن کامل

Parallelizing XML data-streaming workflows via MapReduce

In prior work it has been shown that the design of scientific workflows can benefit from a collection-oriented modeling paradigm which views scientific workflows as pipelines of XML stream processors. In this paper, we present approaches for exploiting data parallelism in XML processing pipelines through novel compilation strategies to the Map-Reduce framework. Pipelines in our approach consist...

متن کامل

Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

For many years Apache Hadoop has been used as a synonym for processing data in the MapReduce fashion. However, due to the complexity of developing MapReduce applications, adoption of this paradigm in genetics has been limited. To alleviate some of the issues, we have previously developed Cloudflow a high-level pipeline framework that allows users to create sophisticated biomedical pipelines usi...

متن کامل

Halt or Continue: Estimating Progress of Queries in the Cloud

With cloud-based data management gaining more ground by day, the problem of estimating the progress of MapReduce queries in the cloud is of paramount importance. This problem is challenging to solve for two reasons: i) cloud is typically a large-scale heterogeneous environment, which requires progress estimation to tailor to non-uniform hardware characteristics, and ii) cloud is often built wit...

متن کامل

Map Combine Map Task Split HDFS file K 1 , N 1 ( a ) Reduce Task { P 2 } { P 1 } { P 3 }

In parallel query-processing environments, accurate, time-oriented progress indicators could provide much utility to users given that queries take a very long time to complete and both interand intra-query execution times can have high variance. In these systems, query times depend on the query plans and the amount of data being processed, but also on the amount of parallelism available, the ty...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009